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Abstract 

A deep neural network model is a powerful framework for learning representations. 
Usually, it is used to learn the relation x ^ y by exploiting the regularities in the input 
X but without considering the output representation y. In structured output prediction 
problems, where the output is multi-dimensional and where structural relations exist 
between the dimensions, the network usually tends to overfit when the training data 
are limited. In order to overcome this issue and circumvent the large required data to 
output accurate predictions, we propose in this paper a regularization scheme for training 
neural networks for these particular tasks. Our proposed scheme aims at incorporating 
the learning of the output representation y in the training process while learning the 
mapping function x ^ y. Our proposition is a multi-task framework containing two 
unsupervised tasks over the input and the output data along with the supervised task. 

We experimented the use of the output labels y without their corresponding input x. 

We evaluate our framework on a facial landmark detection problem which is a typical 
structured output prediction task. We show over two public challenging datasets (LFPW 
and HELEN) that our regularization scheme improves the generalization of deep neural 
networks and accelerates their training. The use of unlabeled data is also explored, show¬ 
ing an additional improvement of the results. We provide an opensource implementation* 
of our framework. 

Keywords: Structured output prediction, representations learning, regularization, deep 

learning, multi-task learning 
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1 Introduction 


In machine learning field, the main task usually consists in learning general regularities over 
the input space in order to provide a specific output. Most of machine learning applications 
aim at predicting a single value: a label for classification or a scalar value for regression. Many 
recent applications address challenging problems where the output lies in a multi-dimensional 
space describing discrete or continuous variables that are most of the time interdependent. A 
typical example is speech recognition, where the output label is a sequence of characters which 
are interdependent, following the statistics of the considered language. These dependencies 
generally constitute a regular structure such as a sequence, a string, a tree or a graph. As it 
provides constraints that may help the prediction, this structure should be either discovered 
if unknown, or integrated in the learning algorithm using prior assumptions. The range of 
applications that deal with structured output data is large. One can cite, among others, image 
labeling [12, 23, 28, 32], statistical natural language processing (NLP) [30, 35, 34], bioinformatics 
[16, 39], speech processing [31, 43] and handwriting recognition [15, 36]. Another example which 
is considered in the evaluation of our proposal in this paper is the facial landmark detection 
problem. The task consists in predicting the coordinates of a set of keypoints given the face 
image as input (Fig.l). The set of points are interdependent throughout geometric relations 
induced by the face structure. Therefore, facial landmark detection can be considered as a 
structured output prediction task. 



Figure 1: Examples of facial landmarks from LFPW [4] training set. 

One main difficulty in structured output prediction is the exponential number of possible 
configurations of the output space. From a statistical point of view, learning to predict accu¬ 
rately high dimensional vectors requires a large amount of data where in practice we usually 
have limited data. In this article we propose to consider structured output prediction as a 
representation learning problem, where the model must i) capture the discriminative relation 
between x (input) and y (output), and ii) capture the interdependencies laying between the 
variables of each space by efficiently modeling the input and output distributions. We address 
this modelization through a regularization scheme for training neural networks. This frame¬ 
work incorporates the learning of the output representation y in the training process through 
an unsupervised task, while learning the function x —)■ y and the input representation x. 

Our contributions is a multi-task framework dedicated to train models for structured output 
prediction. We propose to combine unsupervised tasks over the input and output data in 
concurrence with the supervised task. This parallelism can be seen as a regularization of the 
supervised task. Moreover, as a second contribution, we demonstrate experimentally the benefit 
of using the output labels y without their corresponding inputs x. In this work, the multi task 
framework is instantiated using auto-encoders [42, 5] for both representations learning and 
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exploiting unlabeled data (input) and label-only data (output). We demonstrate the efficiency 
of our proposal over a real-world facial landmark detection problem. 

The rest of the paper is organized as follows. Related works about structured output predic¬ 
tion is proposed in section 2. Section 3 presents the proposed formulation and its optimization 
details. Section 4 describes the instantiation of the formulation using a deep neural network. 
Finally, section 5 details the conducted experiments including the datasets, the evaluation met¬ 
rics and the general training setup. Two types of experiments are explored: with and without 
the use of unlabeled data. Results are presented and discussed for both cases. 

2 Related work 

We distinguish two main categories of methods for structured output prediction. For a long 
time, graphical models have showed a large success in different applications involving ID and 
2D signals. Recently, a new trend has emerged based on deep neural networks. 

2.1 Graphical Models Approaches 

Historically, graphical models are well known to be suitable for learning structures. One of 
their main strength is an easy integration of explicit structural constraints and prior knowledge 
directly into the model’s structure. They have shown a large success in modeling structured 
data thanks to their capacity to capture dependencies among relevant random variables. For 
instance. Hidden Markov Models (HMM) framework has a large success in modeling sequence 
data. HMMs make an assumption that the output random variables are supposed to be in¬ 
dependent which is not the case in many real-world applications where strong relations are 
present. Conditional Random Fields (CRF) have been proposed to overcome this issue, thanks 
to its capability to learn large dependencies of the observed output data. These two frameworks 
are widely used to model structured output data represented as a 1-D sequence [11, 31, 6, 19]. 
Many approaches have also been proposed to deal with 2-D structured output data as an ex¬ 
tension of HMM and CRF. [26] propose a Markov Random Field (MRF) for document image 
segmentation. [40] provide an adaptation of CRF to 2-D signals with hand drawn diagrams 
interpretation. Another extension of CRF to 3-D signal is presented in [41] for 3-D medical 
image segmentation. Despite the large success of graphical models in many domains, they still 
encounter some difficulties. For instance, due to their inference computational cost, graphi¬ 
cal models are limited to low dimensional structured output problems. Furthermore, HMM 
and CRF models are generally used with discrete output data where few works address the 
regression problem [29, 13]. 

2.2 Deep Neural Networks Approaches 

More recently, deep learning based approaches have been widely used to solve structured output 
prediction, especially proposed for image labeling problems. Deep learning domain provides 
many different architectures. Therefore, different solutions were proposed depending on the 
application in hand and what is expected as a result. 

In image labeling task (also known as semantic segmentation), one needs models able to 
adapt to the large variations in the input image. Given their large success in image processing 
related tasks [18] , convolutional neural networks is a natural choice. Therefore, they have been 
used as the core model in image labeling problems in order to learn the relevant features. 
They have been used either combined with simple post-processing in order to calibrate the 
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output [8] or with more sophisticated models in structure modeling such as CRF [12] or energy 
based models [27]. Recently, a new trend has emerged, based on the application of convolution 
[23, 32] or deconvolutional [28] layers in the output of the network which goes by the name 
of fully convolutional networks and showed successful results in image labeling. Despite this 
success, these models does not take in consideration the output representation. 

In many applications, it is not enough to provide the output prediction, but also its prob¬ 
ability. In this case, Conditional Restricted Boltzmann Machines, a particular case of neural 
networks and probabilistic graphical models have been used with different training algorithms 
according to the size of the plausible output configurations [25]. Training and inferring us¬ 
ing such models remains a difficult task. In this same direction, [2] tackle structured output 
problems as an energy minimization through two feed-forward networks. The first is used for 
feature extraction over the input. The second is used for estimating an energy by taking as 
input the extracted features and the current state of the output labels. This allows learning 
the interdependencies within the output labels. The prediction is performed using an iterative 
backpropagation-based method with respect to the labels through the second network which 
remains computationally expensive. Similarly, Recurrent Neural Networks (RNN) are a par¬ 
ticular architecture of neural networks. They have shown a great success in modeling sequence 
data and outputing sequence probability for applications such as Natural Language Processing 
(NLP) tasks [22, 38, 1] and speech recognition [14]. It has also been used for image captioning 
[17]. However, RNN models doe not consider explicitly the output dependencies. 

In [21], our team proposed the use of auto-encoders in order to learn the output distribu¬ 
tion in a pre-training fashion with application to image labeling with promising success. The 
approach consists in two sequential steps. First, an input and output pre-training is performed 
in an unsupervised way using autoencoders. Then, a finetune is applied on the whole network 
using supervised data. While this approach allows incorporating prior knowledge about the 
output distribution, it has two main issues. First, the alteration of a network output layer 
is critical and must be performed carefully. Moreover, one needs to perform multiple trial- 
error loops in order to set the autoencoder’s training hyper-parameters. The second issue is 
overfitting. When pre-training the output auto-encoder, there is actually no information that 
indicates if the pre-training is helping the supervised task, nor when to stop the pre-training. 

The present work proposes a general and easy to use multi-task training framework for 
structured output prediction models. The input and the output unsupervised tasks are embed¬ 
ded into a regularization scheme and learned in parallel with the supervised task. This parallel 
transfer learning which includes an output reconstruction task constitutes the main contribu¬ 
tion of this work. We also show that the proposed framework enables to use labels without 
input in an unsupervised fashion and its effect on the generalization of the model. This can be 
very useful in applications where the output data is abundant such as in a speech recognition 
task where the output is ascii text which can be easily gathered from Internet. In this article, 
we validate our proposal on a facial landmark prediction problem over two challenging pub¬ 
lic datasets (LFPW and HELEN). The performed experiments show an improvement of the 
generalization of deep neural networks and an acceleration of their training. 

3 Multi-task Training Framework for Structured Output 
Prediction 

Let us consider a training set V containing examples with both features and targets {x,y), 
features without target (x, _), and targets without features (_, y). Let us consider a set T which 
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is the subset of V containing examples with at least features x, a set £ which is the subset of 
V containing examples with at least targets y, and a set S which is the subset of V containing 
examples with both features x and targets y. One can note that all examples in S are also in 
T and in C . 

Input task The input task IZin is an unsupervised reconstruction task which assists the main 
task. The role of this task is to project the input data x into an intermediate space x 
through a function and to estimate a reconstruction x of it through a function 

X = Uin (x; Wj„) = (i = Pi„ (x; Wci„) ; ^dvn) , (1) 

where Wj„ = {wcjn,Wdm}- The reconstruction parameters are proper to this task 
and the projection parameters are shared with the main task (see Fig. 2). 

The training criterion for this task is given by : 


(^in (^) ; 

where Cin is an unsupervised learning cost which can be computed on all the samples 
with features (i.e. on P). 

Output task The output task 'R-mit is an unsupervised reconstruction task which also assists 
the main task. Similarly, the role of this task is to project the output data y into an 
intermediate space y through function Pout, and to estimate a reconstruction y of it 
through function P'out- 

y = Pout (y; ^out) = PLt (y = Pout (y; ^cout) ; ^dout) ■ (3) 


,^in) 


cardp 


where Wout = {^cout,^dout}- At the opposite of the input task, the projection parameters 
^cout are proper to this task and the reconstruction parameters w^out are this time shared 
with the main task (see Fig.2). 

The training criterion for this task is given by : 

'y y CoutiP out (y;wo„t),y), 
y&C 

where Cout is an unsupervised learning cost which can be computed on all the samples 
with labels (i.e. on £). 


iPut (-T, card jC 


Maiu task The main task is a supervised task that attempts to learn the mapping function 
A4. between features x and labels y. In order to do so, the first part of the mapping 
function is shared with the projection part Pj„ of the input task and the last part is 
shared with the reconstruction part output task. The middle part m of the 

mapping function AA is specific to this task: 


y = (x; = P’^^ {m {Pin (x; w^in); w*); Wdout) ■ (5) 

where w^up = {wcin,^s,^dout}- Accordingly, and Wdout parameters are respectively 
shared with the input and output tasks. 


Learning this task consists in minimizing its learning criterion J'g, 


{^) ^sup) 


1 

card<S 


Cs{M{x;w sup), y) ■ 

{x,y)€S 


( 6 ) 
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Figure 2: Proposed MTL framework. Black plain arrows stand for intermediate functions, blue 
dotted arrow for input auxiliary task TZin, green dashed arrow for output auxiliary task TZout, 
and red dash-dotted arrow for the main supervised task Ai. 

As a synthesis, our proposal is formulated as a multi-task learning framework (MTL) [7], 
which gathers a main task and two secondary tasks. This framework is illustrated in Fig. 2. 

Learning the three tasks is performed in parallel. This can be translated in terms of training 
cost as the sum of the corresponding costs. Given that the tasks have different importance, we 
weight each cost using a corresponding importance weight Xsupi and \out respectively for 
the supervised, the input and output tasks. Therefore, the full objective of our framework can 
be written as: 

JiV;w) = X 

sup *^(^5 ^sup^ ^in 1 ^in ) T XquI • ^out) ; 

where w = {wcin,^din,^s,^cout,^dout} = = {-Wdin,^sup,^cout} IS the com¬ 

plete set of parameters of the framework. 

Instead of using fixed importance weights that can be difficult to optimaly set, we evolve 
them through the learning epochs. In this context, Eq. 7 is modified as follows : 

J{T>;w) = Xsup{t) • Js{S;Wsup) 

T ■ XJirJyFj ^in) T ' XIouti.A'i^out) ; (8) 

where t > 0 indicates the learning epochs. 

The main advantage of Eq.8 is that it allows an interaction between the main supervised 
task and the output task. This interaction will help to prevent the output task from overfitting. 

4 Implementation 

In this work, we implement our framework throughout a deep neural network. The main 
supervised task is performed using a deep neural network (DNN) with K layers. Secondary 
reconstruction tasks are carried out by auto-encoders (AE): the input task is achieved using 
an AE that has Ki^ layers in its encoding part, with an encoded representation of the same 
dimension as x. Similarly, the output task is achieved using an AE that has Kout layers in 
its decoding part, with an encoded representation of the same dimension as y. At least one 
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layer must be dedicated in the DNN to link x and y in the intermediate spaces. Therefore, 
Kin + Kout < K . 

Parameters are the parameters of the whole input AE, v^out ^^re the parameters of the 
whole output AE and v^sup are the parameters of the main neural network (NN). The encoding 
layers of the input AE are tied to the first layers of the main NN, and the decoding layers of 
the output AE are in turn tied to the last layers of the main NN. If are the parameters of 
layer i of a neural network, then Wi to parameters of the input AE are shared with Wi 

to ^Kir^ parameters of the main NN. Moreover, if w_^ are the parameters of last minus i — 1 
layer of a neural network, then parameters w^Kout output AE are shared with 

the parameters w^Kout of the main NN. 

During training, the loss function of the input AE is used as the loss function of the 
output AE is used as jTono the loss function of the main NN is used as ^ 5 . 

Optimizing Eq .8 can be performed using Stochastic Gradient Descent. In the case of task 
combination, one way to perform the optimization is to alternate between the tasks when needed 
[9, 45]. In the case where the training set does not contain unlabeled data, the optimization of 
Eq .8 can be done in parallel over all the tasks. When using unlabeled data, the gradient for 
the whole cost can not be computed at once. Therefore, we need to split the gradient for each 
sub-cost according to the nature of the samples at each mini-batch. Eor the sake of clarity, 
we illustrate our optimization scheme in Algorithm 1 using on-line training (i.e. training one 
sample at a time). Mini-batch training can be performed in the same way. 


Algorithm 1 Our training strategy for one epoch 
1: D is the shuffled training set. 5 a sample. 

2: for 5 in D do 

3: if B is (x, _) or (x, y) then 

4: Update Make a gradient step toward Xin x using B (Eq.2) 

5: end if 

6: if B is (_, y) or (x, y) then 

7: Update Wout'- Make a gradient step toward Xout x yJout using B (Eq.4) 

8: end if 

9: # parallel parameters update 

10: if B is (x, y) then 

11: Update w: Make a gradient step toward J using B (Eq.8) 

12: end if 

13: Update Xg^p^ X^^i S-ud Xq^^ 

14: end for 


5 Experiments 

We evaluate our framework on a facial landmark detection problem which is typically a struc¬ 
tured output problem since the facial landmarks are spatially inter-dependent. Facial landmarks 
are a set of key points on human face images as shown in Fig. 1. Each key point is defined 
by the coordinates {x,y) in the image {x,y G E). The number of landmarks is dataset or 
application dependent. 

It must be emphasized here that the purpose of our experiments in this paper was not to 
outperform the state of the art in facial landmark detection but to show that learning the output 
dependencies helps improving the performance of DNN on that task. Thus, we will compare a 
model with/without input and output training. [44] use a cascade of neural networks. In their 
work, they provide the performance of their first global network. Therefore, we will use it as a 
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reference to compare our performance (both networks has close architectures) except they use 
larger training dataset. 

We first describe the datasets followed by a description of the evaluation metrics used in 
facial landmark problems. Then, we present the general setup of our experiments followed by 
two types of experiments: without and with unlabeled data. An opensource implementation of 
our MTL deep instantiation is available onlineb 

5.1 Datasets 

We have carried out our evaluation over two challenging public datasets for facial landmark 
detection problem: LFPW [4] and HELEN [20]. 

LFPW dataset consists of 1132 training images and 300 test images taken under uncon¬ 
strained conditions (in the wild) with large variations in the pose, expression, illumination and 
with partial occlusions (Eig.l). This makes the facial point detection a challenging task on this 
dataset. From the initial dataset described in LFPW [4], we use only the 811 training images 
and the 224 test images provided by the ibug websiteL Ground truth annotations of 68 facial 
points are provided by [33]. We divide the available training samples into two sets: validation 
set (135 samples) and training set (676 samples). 

HELEN dataset is similar to LFPW dataset, where the images have been taken under 
unconstrained conditions with high resolution and collected from Flikr using text queries. It 
contains 2000 images for training, and 330 images for test. Images and face bounding boxes are 
provided by the same site as for LFPW. The ground truth annotations are provided by [33]. 
Examples of dataset are shown in Fig.3. 



Figure 3: Samples from HELEN [20] dataset. 


All faces are cropped into the same size (50 x 50) and pixels are normalized in [0,1]. The 
facial landmarks are normalized into [-1,1]. 

5.2 Metrics 

In order to evaluate the prediction of the model, we use the standard metrics used in facial 
landmark detection problems. 

The Normalized Root Mean Squared Error (NRMSE)[10] (Eq.9) is the Euclidean distance 
between the predicted shape and the ground truth normalized by the product of the number 
of points in the shape and the inter-ocular distance D (distance between the eyes pupils of the 


^https://github.com/sbelharbi/structured-output-ae 

^300 faces in-the-wild challenge http://ibug.doc.ic.ac.uk/resources/300-W/ 







ground truth) 


NRMSE(s„ s,) = 


N 


Eii^ 



(9) 


where Sp and Sg are the predicted and the ground truth shapes, respectively. Both shapes have 
the same number of points N. D is the inter-ocular distance of the shape Sg. 

Using the NMRSE, we can calculate the Cumulative Distribution Function for a specific 
NRMSE (CDF^rmse) value (Eq.lO) overall the database, 


CARD{NRMSE < x) 


( 10 ) 


CDF, = 


n 


where CARD{.) is the cardinal of a set. n is the total number of images. 

The CDFjsirmse represents the percentage of images with error less or equal than the 
specified NRMSE value. For example a CDFq i = 0.4 over a test set means that 40% of the 
test set images have an error less or equal than 0.1. A CDF curve can be plotted according to 
these CDFjsirmse values by varying the value of NRMSE. 

These are the usual evaluation criteria used in facial landmark detection problem. To have 
more numerical precision in the comparison in our experiments, we calculate the Area Under 
the CDF Curve (AUC), using only the NRMSE range [0,0.5] with a step of 10“^. 

5.3 General training setup 

To implement our framework, we use: - a DNN with four layers A' = 4 for the main task; 
- an input AE with one encoding layer Kjn = 1 and one decoding layer; - an output AE with 
one encoding layer and one decoding layer Kgut = 1- Referring to Fig. 2, the size of the input 
representation x and estimation x is 2500 = 50 x 50; the size of the output representation y 
and estimation y is 136 = 68 x 2, given the 68 landmarks in a 2D plane; the dimension of 
intermediate spaces x and y have been set to 1025 and 64 respectively; finally, the hidden layer 
in the m link between x and y is composed of 512 units. The size of each layer has been set 
using a validation procedure on the LFPW validation set. 

Sigmoid activation functions are used everywhere in the main NN and in the two AEs, 
except for the last layer of the main NN and the tied last layer of output AE which use a 
hyperbolic tangent activation function to suite the range [—1, 1] for the output y. 

We use the same architecture through all the experiments for the different training config¬ 
urations. To distinguish between the multiple configurations we set the following notations: 

1. MLP, a DNN for the main task with no concomitant training; 

2. MLP + in, a DNN with input AE parallel training; 

3. IVILP -|- out, a DNN with output AE parallel training; 

4. MLP + in -b out, a DNN with both input and output reconstruction secondary tasks. 

We recall that the auto-encoders are used only during the training phase. In the test phase, 
they are dropped. Therefore, the final test networks have the same architecture in all the 
different configurations. 

Beside these configurations, we consider the mean shape (the average of the y in the training 
data) as a simple predictive model. For each test image, we predict the same estimated mean 
shape over the train set. 
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To clarify the benefit of our approach, all the configurations must start from the same initial 
weights to make sure that the obtained improvement is due to the training algorithm, not to 
the random initialization. 

For the input reconstruction tasks, we use a denoising auto-encoder with a corruption level 
of 20% for the first hidden layer. For the output reconstruction task, we use a simple auto¬ 
encoder. To avoid overfitting, the auto-encoders are trained using L 2 regularization with a 
weight decay of 10“^. 

In all the configurations, the update of the parameters of each task (supervised and unsu¬ 
pervised) is performed using Stochastic Gradient Descent with momentum [37] with a constant 
momentum coefficient of 0.9. We use mini-batch size of 10. The training is performed for 1000 
epochs with a learning rate of 10“^. 

In these experiments, we propose to use a simple linear evolution scheme for the importance 
weights \sup (supervised task), (input task) and Xout (output task). We retain the evolution 
proposed in [3], and presented in Fig. 4. 



Figure 4: Linear evolution of the importance weights during training. 

The hyper-parameters (learning rate, batch size, momentum coefficient, weight decay, the 
importance weights) have been optimized on the LFPW validation set. We apply the same 
optimized hyper-parameters for HELEN dataset. 

Using these configurations, we perform two types of experiments: with and without unla¬ 
beled data. We present in the next sections the obtained results. 

5.3.1 Experiments with fully labeled data 

In this setup, we use the provided labeled data from each set in a classical way. For LFPW set, 
we use the 676 available samples for training and 135 samples for validation. For HELEN set, 
we use 1800 samples for training and 200 samples for validation. 

In order to evaluate the different configurations, we first calculate the Mean Squared Error 
(MSE) of the best models found using the validation during the training. Column 1 (no 
unlabeled data) of Tab.l, 2 shows the MSE over the train and valid sets of LFPW and HELEN 
datasets, respectively. Compared to an MLP alone, adding the input training of the first hidden 
layer slightly reduces the train and validation error in both datasets. Training the output layer 
also reduces the train and validation error, with a more important factor. Combining the input 
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train of the first hidden layer and output train of the last layer gives the best performance. 
We plot the tracked MSE over the train and valid sets of HELEN dataset in Eig.7(a), 7(b). 
One can see that the input training reduces slightly the validation MSE. The output training 
has a major impact over the training speed and the generalization of the model which suggests 
that output training is useful in the case of structured output problems. Combining the input 
and the output training improves even more the generalization. Similar behavior was found on 
LEPW dataset. 

At a second time, we evaluate each configuration over the test set of each datasets using the 
CDFq i metric. The results are depicted in Tab. 3, 4 in the first column for LEPW and HELEN 
datasets, respectively. Similarly to the results previously found over the train and validation 
set, one can see that the joint training (supervised, input, output) outperforms all the other 
configurations in terms of CDFq i and AUC. The CDE curves in Eig.8 also confirms this result. 
Compared to the global DNN in [44] over LEPW test set, our joint trained MLP performs better 
([44]: CDFq i = 65%, ours: CDFq i = 69.64%), despite the fact that their model was trained 
using larger supervised dataset (combination of multiple supervised datasets beside LEPW). 

An illustrative result of our method is presented in Fig. 5, 6 for LEPW and HELEN using 
an MLP and MLP with input and output training. 





Figure 5: Examples of prediction on LEPW test set. For visualizing errors, red segments 
have been drawn between ground truth and predicted landmark. Top row: MLP. Bottom row: 
MLP+in+out. (no unlabeled data) 


Table 1: MSE over LEPW: train and valid sets, at the end of training with and without 
unlabeled data. _ 



No unlabeled data 

With unlabeled data 

MSE train 

MSE valid 

MSE train 

MSE valid 

Mean shape 

7.74 X 10“^ 

8.07 X 10-3 

7.78 X 10-3 

8.14 X 10-3 

MLP 

3.96 X 10“^ 

4.28 X 10-3 

- 

- 

MLP + in 

3.64 X 10-3 

3.80 X 10-3 

1.44 X 10-3 

2.62 X 10-3 

MLP + out 

2.31 X 10-3 

2.99 X 10-3 

1.51 X 10-3 

2.79 X 10-3 

MLP + in + out 

2.12 X 10-3 

2.56 X 10-3 

1.10 X 10-3 

2.23 X 10-3 
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Figure 6: Examples of prediction on HELEN test set. Top row: MLP. Bottom row: 
MLP+in+out. (no unlabeled data) 


Table 2: MSE over HELEN: train and valid sets, at the end of training with and without data 
augmentation. 



Fully labeled data only 

Adding unlabeled or label-only data 

MSE train 

MSE valid 

MSE train 

MSE valid 

Mean shape 

7.59 X 10“^ 

6.95 X 10-3 

7.60 X 10-3 

0.95 X 10-3 

MLP 

3.39 X 10-^ 

3.67 X 10-3 

- 

- 

MLP + in 

3.28 X 10-3 

3.42 X 10-3 

2.31 X 10-3 

2.81 X 10-3 

MLP + out 

2.48 X 10-3 

2.90 X 10-3 

2.00 X 10-3 

2.74 X 10-3 

MLP + in + out 

2.34 X 10-3 

2.53 X 10-3 

1.92 X 10-3 

2.40 X 10-3 


Table 3: AUC and CDFo.i performance over LFPW test dataset with and without unlabeled 
data. _ 



Fully labeled data only 

Adding unlabeled or label-only data 

AUC 

CDFoi 

AUC 

CDFoi 

Mean shape 

68.78% 

30.80% 

77.81% 

22.33% 

MLP 

76.34% 

46.87% 

- 

- 

MLP + in 

77.13% 

54.46% 

80.78% 

67.85% 

MLP + out 

80.93% 

66.51% 

81.77% 

67.85% 

MLP + in + out 

81.51% 

69.64% 

82.48% 

71.87% 


Table 4: AUC and CDFq.i performance over HELEN test dataset with and without unlabeled 
data. _ 



Fully labeled data only 

Adding unlabeled or label-only data 

AUC 

CDFoi 

AUC 

CDFoi 

Mean shape 

64.60% 

23.63% 

64.76% 

23.23% 

MLP 

76.26% 

52.72% 

- 

- 

MLP + in 

77.08% 

54.84% 

79.25% 

63.33% 

MLP + out 

79.63% 

66.60% 

80.48% 

65.15% 

MLP + in + out 

80.40% 

66.66% 

81.27% 

71.51% 


5.3.2 Data augmentation using unlabeled data or label-only data 

In this section, we experiment our approach when adding unlabeled data (input and output). 
Unlabeled data (i.e. image faces wihtout the landmarks annotation) are abundant and can be 
found easily for example from other datasets or from the Internet which makes it practical and 
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Figure 7: MSE during training epochs over HELEN train (a) and valid (b) sets using different 
training setups for the MLP. 

realistic. In our case, we use image faces from another dataset. 

In the other hand, label-only data (i.e. the landmarks annotation without image faces) are 
more difficult to obtain because we usually have the annotation based on the image faces. One 
way to obtain accurate and realistic facial landmarks without image faces is to use a 3D face 



Error over train set (MSE) (HELEN) 
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Cumulative distribution function (CDF) of NRMSE over LFPW test set. 



NRMSE 


(a) 

Cumulative distribution function (CDF) of NRMSE over HELEN test set. 



mean shape, CDF(0.1 )= 23.636%, AUC= 64.609% 
MLP, CDF(0.1)= 52.727%, AUC= 76.261% 

MLP+ in , CDF(0.1)= 54.848%, AUC= 77.082% 

MLP+ out, CDF(0.1)= 66.061%, AUC= 79.633% 
l\/ILP+ in + out, CDF(0.1)= 66.667%, AUC= 80.408% 
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Figure 8: CDF curves of different configurations on: (a) LFPW, (b) HELEN. 


model as a generator. We use an easier way to obtain facial landmarks annotation by taking 
them from another dataset. 

In this experiment, in order to add unlabeled data for LFPW dataset, we take all the image 


14 













































































































faces of HELEN dataset (train, valid and test) and vice versa for HELEN dataset by taking 
all LEPW image faces as unlabeled data. The same experiment is performed for the label-only 
data using the facial landmarks annotation. We summarize the size of each train set in Tab. 5.. 


Table 5: Size of augmented LEPW and HELEN train sets. 


Train set / size of 

Supervised data 

Unsupervised input x 

Unsupervised output y 

LEPW 

676 

2330 

2330 

HELEN 

1800 

1035 

1035 


We use the same validation sets as in Sec. 5.3.1 in order to have a fair comparison. The MSE 
are presented in the second column of Tab.l, 2 over LEPW and HELEN datasets. One can see 
that adding unlabeled data decreases the MSE over the train and validation sets. Similarly, we 
found that the input training along with the output training gives the best results. Identically, 
these results are translated in terms of CDFq i and AUC over the test sets (Tab. 3, 4). All these 
results suggest that adding unlabeled input and output data can improve the generalization of 
our framework and the training speed. 


6 Conclusion and Future Work 

In this paper, we tackled structured output prediction problems as a representation learning 
problem. We have proposed a generic multi-task training framework as a regularization scheme 
for structured output prediction models. It has been instantiated through a deep neural net¬ 
work model which learns the input and output distributions using auto-encoders while learning 
the supervised task x ^ y. Moreover, we explored the possibility of using the output labels y 
without their corresponding input data x which showed more improvement in the generaliza¬ 
tion. Using a parallel scheme allows an interaction between the main supervised task and the 
unsupervised output task which helped preventing the overfitting of the last one. 

We evaluated our training method on a facial landmark detection task over two public 
datasets. Obtained results showed that our proposed regularization scheme improves the gen¬ 
eralization of neural networks model and speeds up their training. We believe that our approach 
provides an alternative for training deep architectures for structured output prediction where 
it allows the use of unlabeled input and label of the output data. 

As a future work, we plan to evolve automatically the importance weights of the tasks. 
For that we can consider the use of indicators based on the training and the validation errors 
instead of the learning epochs to better guide the evolution. Furthermore, one may consider 
other kind of models instead of simple auto-encoders in order to learn the output distribution. 
More specifically, generative models such as variational and adversarial auto-encoders [24] could 
be explored. 
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